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Abstract 

The systematics of indices of physico-chemical properties of codons and amino acids across 
the genetic code are examined. Using a simple numerical labelling scheme for nucleic acid bases, 
A = (—1,0), C = (0,-1), G = (0,1), U = (1,0), data can be fitted as low order polynomials 
of the 6 coordinates in the 64-dimensional codon weight space. The work confirms and extends 
the recent studies by Siemion et al. (1995) (BioSystems 36, 63-69) of the conformational pa- 
rameters. Fundamental patterns in the data such as codon periodicities, and related harmonics 
and reflection symmetries, are here associated with the structure of the set of basis monomials 
chosen for fitting. Results are plotted using the Siemion one-step mutation ring scheme, and 
variants thereof. The connections between the present work, and recent studies of the genetic 
code structure using dynamical symmetry algebras, are pointed out. 
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1 Introduction and main results 



Fundamental understanding of the origin and evolution of the genetic code (Osawa et al., 1991) 
must be grounded in detailed knowledge of the intimate relationship between the molecular 
biochemistry of protein synthesis, and the retrieval from the nucleic acids of the proteins' stored 
design information. However, as pointed out by Lacey and Mullins in 1983, although 'the nature 
of an evolutionary biochemical edifice must reflect ... its constituents, . . . properties which were 
important to prebiotic origins may not be of relevance to contemporary systems'. Ever since 
the final elucidation of the genetic code, this conviction has led to many studies of the basic 
building blocks themselves, the amino acids and the nucleic acid bases. Such studies have sought 
to catalogue and understand the spectrum of physico-chemical characteristics of these molecules, 
and of their mutual correlations. The present work is a contribution to this programme. 

Considerations of protein structure point to the fundamental importance of amino acid hy- 
drophilicity and polarity in determining folding and enzymatic capability, and early work (Woese 
et al., 1966; Volkenstein, 1966; Grantham, 1974) concentrated on these aspects; Weber and 
Lacey (1978) extended the work to mono- and di-nucleosides. Jungck (1978) concluded from a 
compilation of more than a dozen properties that correlations between amino acids and their cor- 
responding anticodon dinucleosidcs were strongest on the scale of hydrophobicity/hydrophilicity 
or of molecular volume/polarity. For a comprehensive review see Lacey and Mullins (1983). 

Subsequent to this early work, using statistical sequence information, conformational indices 
of amino acids in protein structure have been added to the data sets (Goodman and Moore, 
1977). Recently Siemion (1994a, 1994b) has considered the behaviour of these parameters across 
the genetic code, and has identified certain periodicities and pseudosymmetries present when 
the data is plotted in a certain rank ordering called 'one-step mutation rings', being generated 
by a hierarchy of cyclic alternation of triplet base letters (Siemion and Stefanowicz,1992). The 
highest level in this hierarchy is the alternation of the second base letter, giving three major 
cycles based on the families U, C and A, each sharing parts of the G family. The importance of 
the second base in relation to amino acid hydrophilicity is in fact well known (Weber and Lacey, 
1978; Lacey and Mullins, 1983; Taylor and Coates, 1989), and the existence of three independent 
correlates of amino acid properties, again associated with the [/, C and A families, has also been 
statistically established by principal component analysis (Sjostrom and Wold, 1985). 

Given the existence of identifiable patterns in the genetic code in this sense, it is of some 
interest and potential importance to attempt to describe them more quantitatively. Steps along 
these lines were taken in 1995 by Siemion et al. With a linear rank ordering of amino acids 
according to 'mutation angle' 7rA:/32, k = 1, ... ,64 along the one-step mutation rings, param- 
eter P" was reasonably approximated by trigonometric functions which captured the essential 
fluctuations in the data. 

The metaphor of a quantity such as the Siemion number k, allowing the genetic code to be 
arranged in a way which best reflects its structure, in analogy with elemental atomic nTimber 
Z and the chemical periodic table, is an extremely powerful one. The present paper takes 
up this idea, but in a more flexible way which does not rely on a single parameter. Instead, 
a natural labelling scheme is used which is directly related to the combinatorial fact of the 
triplet base codon structure of the genetic code, and the four-letter base alphabet. Indeed, any 
bipartite labelling system which identifies each of the four bases A, C, G, U, extends naturally to 
a composite labelling for codons, and hence amino acids. We choose for bases two coordinates 
as A = (—1,0), C = (0,-1), G = (0,1), U = (1,0), so that codons are labelled as ordered 
sextuplets, for example ACG = (—1, 0, 0, —1, 0, 1). 

In quantitative terms, any numerical indices of amino acid or codon properties, of physico- 



chemical or biological nature, can then be modelled as some functions of the coordinates of 
this codon 'weight space'. Because of other algebraic approaches to the structure of the genetic 
code, we take polynomial functions (for simplicity, of as low order as possible). This restriction 
does not at all exclude the possibility of periodicities and associated symmetry patterns in the 
data. In fact, as each of the six coordinates takes discrete values 0, ±1, appropriately chosen 
monomials can easily reproduce such effects (with coefficients to be fitted which reflect the 
relative strengths of various different 'Fourier' components). Quite simply, the directness of the 
linear rank ordering, as given by the Siemion number k, which suggests Fourier series analysis of 
the data, is here replaced by a more involved labelling system, but with numerical data modelled 
as simple polynomial functions. 

The main results of our analysis are as follows. In §^ the labelling scheme for nucleic acid 
bases is introduced, leading to 4^-dimensional 'weight spaces' for length-A^ RNA strands: in 
particular, 16-dimensional for N = 2, and 64-dimensional for the sextuplet codon labelling 
(A^=3). For N = 2 the dinucleoside hydrophilicity, dinucleotide hydrophobicity, and free energy 
of formation of 2-base RNA duplices are considered. Displayed as linear plots (or bar charts) 
on a ranking from 1 to 16, the data have obvious symmetry properties, and corresponding basis 
monomials are identified, resulting in good fits. Only four coordinates are involved for these 
16- part data sets (see table |l]). Moving in §|| on to codon properties as correlated to those 
of amino acids, Siemion number k which establishes amino acid ranking by mutation angle is 
briefly reviewed. It is shown that the trigonometric approximation of Siemion et al. (Siemion 
et al., 1995) to the Chou-Fasman conformational parameters P", (Chou and Fasman, 1975; 
Fasman, 1989) is effectively a four-parameter function which allows for periodicities of 32/5, 
8, 32/3 and 64 codons. Again, simple basis monomials having the required elements of the 
symmetry structure of are identified, leading to a reasonable (four-parameter) fit. Results 
are displayed as Siemion mutation-angle plots. P^ is treated in a similar fashion. The method 
established in §§|| and ^ is then applied in §^ to other amino acid properties, including relative 
hydrophilicity (Weber and Lacey, 1978) and Grantham polarity (Grantham, 1974). It is clearly 
shown that appropriate polynomial functions can be fitted to most of them (amino acid data is 
summarised in table |^. 

In §^ some concluding remarks, and outlook for further development of these ideas are given. 
It is emphasised that, while the idiosyncracies of real biology make it inappropriate to regard this 
type of approach as anything but approximate, nonetheless there may be some merit in a more 
rigorous follow up to establish our conclusions in a statistically valid way. This is particularly 
interesting in view of the appendix, f|A[ This gives a brief review of algebraic work based on 
methods of dynamical symmetries in the analysis of the excitation spectra of complex systems 
(such as atoms, nuclei and molecules), which has recently been proposed to explain the origin 
and evolution of the genetic code. Specifically, it is shown how the labelling scheme adopted 
in the paper arises naturally in the context of models, based either on the Lie superalgebra 
^5,0 ~ ■5^6/1), or the Lie algebra Bq ^ so(13), or related semisimple algebras. The origins 
and nature of the polynomial functions adopted in the paper, and generalisations of these, are 
also discussed in the algebraic context. The relationship of the present paper to the dynamical 
symmetry approach is also sketched in §^ below. 

2 Codon systematics 

Ultimately our approach involves a symmetry between the 4 heterocyclic bases U,C,A,G com- 
monly occurring in RNA. A logical starting point then, is to consider the physical properties of 
small RNA molecules. Dinucleosides and dinucleotides in particular are relevant in the informa- 



tional context of the genetic code and anticode, and moreover are the building blocks for larger 
nucleic acids (NA's). What follows in this section is a numerical study of some properties of 
NA's consisting of 2 bases, while in later sections NA's with 3 bases (i.e. codons and anticodons) 
are considered in the context of the genetic code as being correlated with properties of amino 
acids. 

As mentioned in the introduction, we give each NA base coordinates in a two-dimensional 
'weight space', namely A = (—1,0), C = (0,-1), G = (0,1), U = (1,0) with the axes labelled 
d, m respectively [|. Dinucleosides and dinucleotides are therefore associated with four coordi- 
nates (di, mi, (^2, m2), e.g. AC = (—1,0,0,-1) with subscripts referring to the first and second 
base positions. 

The physical properties of nucleic acids we choose to fit to are the relative hydrophilicities 
Rf of the 16 dinucleoside monophosphates as obtained by Weber and Lacey (1978), the relative 
hydrophobicity of dinucleotides as calculated by Jungck (Jungck, 1978) from the mononu- 
cleotide data of Garel et al. (1973), and the 16 canonical (Crick- Watson) base-pair stacking 
parameters of Xia et al. (1998) used to compute the free heat of formation G^y of formation of 
duplex RNA strands at 37° Centigrade. 

It should be noted that the dinucleoside quantity Rx was computed as the product of exper- 
imentally derived Rx values for mononucleotides under the assumption that this determines the 
true dinucleotide values to within 95 %. A result of this is that Rx is automatically the same 
for dinucleotides 5' — XY — 3' and 5' — YX — 3', (naturally the same holds for molecules with 
the reverse orientation); thus Rx is at best an approximate symmetry. 

The 16 Turner free-energy parameters are a subset of a larger number of empirically de- 
termined thermodynamic "rules of thumb" (see Xia et al.,1998; Mathews et al. 1999, for the 
most recent results) , developed to predict free heats of formation of larger RNA and DNA 
molecules. The possibility that these rules have an underlying group-theoretical structure is a 
consideration for a future work. For now it suffices to observe that due to geometry (see table 
H]) the duplex formed by 5' — XY — 3' with 3' — YX — 5' is just a rotated version of the duplex 
5' — XY — 3' with 3' — YX — 5'. Here X denotes the Crick- Watson complementary base to X. 
Furthermore the duplices formed by so-called "self-complementary" dinucleosides (5' — GC — 3' 
with 5' — GG — 3' and 5' — AU — 3' with 5' — UA — 3') are thermodynamically suppressed due 
to the extra rotational symmetry (i.e there are 2 ways such a duplex can form) and one needs 
to include extra monomials to compensate for this. It is easy to see that the above rotational 
symmetry corresponds to the change of coordinates 

{di,mi,d2,m2) {-d2, -in2, -di, -mi) (1) 

and thus we need look only at monomials which respect this symmetry, for example mi — m2, 
did2 and {dim2+d2mi). The least-squares fit to the most recent values of the Turner parameters 
(Xia et al., 1998) is shown in figure ^ and is given by 

— G37((ii, mi, d2,m2) = 1.133 + 0.02{{mi + di)m2 + d2{mi - di)) + 1.001{ml + ml) 

- O.ldida + 0.035(di - da) + 0.1(mi - ma) 
+ 0.165mim2(m2 - mi + 1) + 0.0225did2{d2 - di - 1). (2) 

Here we have considered all linear and quadratic terms respecting the symmetry Eq.(|l|) and 
added cubic symmetry-breaking terms which are specific to the self-complementary duplices. 

*The choice (±1, ±1) and (±1, Tl) for the four bases simply represents a 45° rotation of the adopted scheme, 
which turns out to be more convenient for our purposes. The nonzero labels at each of the four base positions 
are given by the mnemonic 'diamonrf'. 



The number of monomials may be reduced with the identities: 



dl = (3) 
rriidi = 0, (4) 

Encouraged by this success we attempt a similar fit to i?/ and Rx- While there is no obvious 
underlying symmetry a priori as in the previous discussion, one might expect these properties 
to be anti-correlated and so the same set of monomials is considered for each. Qualitatively 
faithful fits may be obtained using a small number of monomials, as shown in figure ^. The 
functions 

Rf{dl,mi,d2,m2) = 0.191 - 0.087(ci? + dl) + 0.09di + 0.107^2 - 0.053mi - 0.077^2, (5) 
Rx{di,mi,d2,m2) = 0.3278 + 0.093((if + d^) - 0.1814((ii + ^2) + 0.0539(mi + m2) (6) 

are seen to compare favourably to the experimental values and that moreover Rf and Rx are 
roughly anti-correlated. Thus fitting to these using the same set of monomials seems to be a 
valid procedure in this initial approximation. 

3 Amino acid conformational parameters 

As a case study for amino acid properties (as opposed to their correlated codon properties 
in §|2| above) we consider the structural conformational parameters P" and P^, which have 
been discussed by Siemion (1994a, 1994b). In 1995 Siemion et al. introduced a quantity k , 
k = 1,...,64, which defined the so-called 'mutation angle' 7rfc/32 for a particular assignment 
of codons (and hence of amino acids) in rank ordering. This is a ramification of the four-ring 
ordering used above for plots (expanded from 16 to 64 points), and arises from a certain hierarchy 
of one-step base mutations. It assigns the following k values to the NN'Y and NN'R codons[] 
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GUY 
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AUY 
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wherein (as in the 'four ring' scheme) the third base alternates as — G,A — U,G — G^U- 
A^G — . . . for purine-pyrimidine occurrences — R — Y — Y — R — ... . This 'mutation ring' 
ordering corresponds to a particular trajectory around the diamond-shaped representation of 
the genetic code (figure ^, which is pictured in figure ^ (Siemion, 1994a) where nodes have been 
labelled by amino acids. 

Inspecting the trends of assigned values for the amino acids ordered in this way, a 
suggestive 8-codon periodicity, and a plausible additional C2 rotation axis about a spot in the 
centre of the diagram, have be identified (Siemion, 1994a,1994b). Figure ^ gives various fits to 
this data, as follows. Firstly, consideration of the modulation of the peaks and troughs of the 

^ Individual codons are labelled so that these Y ,R positions are at the midpoints of their respective k intervals. 
Thus GGR occupies < fc < 2, with nominal fc = 1 and codons k{GGA) = 0.5, k{GGC) = 1.5. 



period-8 component, on either side of the centre at /c = 0, leads to a trigonometric function 
(Siemion et al., 1995). 



Piik) = 1.0 - [0.32 + 0.12cos(— )] cos(— ) - 0.09sin(— ) (7) 

where the parameters are estimated simply from the degree of variation in their heights (and 
0.44 = 0.32 + 0.12 is the average amplitude). Least-squares fitting of the same data in fact leads 
to a similar function, 

PS{k) = 1.02 - [0.22 + 0.21 cos(— )] cos(— ) + 0.005 sin(—). (8) 

lb 4 32 

From the point of view of Fourier series, however, the amplitude modulation of the codon period- 
8 term in Pg or P^ merely serves to add extra beats of period 32/5 and 32/3 of equal weight 
0.06; an alternative might then be to allow different coefficients. This gives instead the fitted 
function 

P^{k) = 1.02 - 0.22cos(— ) -0.11cos(— ) - 0.076 cos(—) (9) 

which has no sin(||) term, but is almost indistinguishable from equation (|8|) above (note that 
0.22 + 0.21 ~ 0.22 + 0.11 + 0.07 ~ 0.32 + 0.12 = .44). In figure | the data is displayed as 
a histogram along with P^, and Pp above; as can be seen, both fits show similar trends, and 
both have difficulty in reproducing the data around the first position codons of the C family in 
the centre of the diagram (see caption to figure |5|) . 

Basing the systematics of the genetic code on numerical base labels, as advocated in the 
present work, a similar analysis to the above trigonometric functions is straighforward, but now 
in terms of polynomials over the six codon (i.e. trinucleotide) coordinates {di,mi,d2,m2, ds, ms). 
There is no difficulty in establishing basic 8-codon periodic functions; combinations such as 
|(i3 — (with values — |, — ^, ^, +| on A, G, C, U), or more simply the perfect Y/R discrimi- 
nator ds — 771-3 (with values —1, +1 on R, Y respectively) can be assumed. Similarly, terms such 
as di ±mi have period 16, and d2 ±m-2 have period 64. The required modulation of the 8-codon 
periods can also be regained by including in the basis functions for fitting a term such as d^, 
and finally an enhancement of the C ring family boxes GCN, CCN is provided by the cubic 
term mim2{m2 — !)• The resulting least-squares fitted function is 

PQ{di,mi,d2,m2,d3,m3) = 

0.86 + 0.24dl + 0.21mim2(m2 - 1) - 0.02(^3 - m^) - 0.075d|(d3 - ms) (10) 

and is plotted against the P" data in figure 0. The resulting fit^ is rather insensitive to the 
weights of ^3 and (allowing unconstrained coefficients in fact results in identical weights 
+.02 for the linear terms and —.064, +.085 for the ^2 coefficients respectively). It should be 
noted that, despite much greater fidelity in the C ring, Pq shows similar features to the least 
squares trigonometric fits P^ and Pp in reproducing the 8-codon periodicity less clearly than P^ 
(see figure |5|). This indicates either that the minimisation is fairly shallow at the fitted functions 
(as suggested by the fact that P^ and Pp differ by less than ±0.01 over one period), or that a 
different minimisation algorithm might yield somewhat different solutions. To show the possible 
range of acceptable fits, a second monomial is displayed in figure |^ whose ^2(^3 — ms) coefficient 

* In contrast to the trigonometric fits which are only intended to fit the data for specified codons (indicated 
by the dots in figure the least-squares fit is applied for the polynomial functions to all 64 data points. See 
Siemion (Siemion et al.,1995) and the captions to figures |^ and ^ 



is chosen as —0.2 rather than —0.075. This function plays the role of the original estimate Pg 
of figure ^ in displaying a much more pronounced eight-codon periodicity than allowed by the 
least-squares algorithm. 

The nature of the eight-codon periodicity is related to the modulation of the conformational 
status of the amino acids through the R or Y nature of their third codon base (Siemion and 
Siemion, 1994). A sharper discriminator of this is the difference — , which suggests that 
a more appropriate basis for identifying numerical trends is with — P^ (the helix-forming 
potential) and + P^^ ( generic-structure- forming potential). Although we have not analysed 
the data in this way, this is indirectly borne out by separate fitting (along the same lines as 
above) of P^, for which no significant component of {d^ — m^) is found. A typical five-parameter 
fit, independent of third base coordinate, is given by 

P^{di,mi,d2,m2,d3,m3) = 

1.02 + .26^2 + .09dl - .19d2{di - mi) - .Idim2{m2 - 1) - .16m?m2(m2 - 1). (11) 

Figure ^ shows that this function does indeed average over the third base Y/R fiuctuations 
evident in the A family data. A major component appears to be the dependence on {di — mi), 
that is, on the Y/R nature of the first codon base, responsible for the major peaks and troughs 
visible on the A and U rings (and reflected in the (^2(^1 — rrii) term). The cubic and quartic 
terms follow the modulation of the data on the C ring. 

The suggested pseudosymmetries of the conformational parameters are important for trigono- 
metric functions of the mutation angle, and for polynomial fits serve to identify leading monomial 
terms with simple properties. The d2{di — mi) term in the fit of P^ above has been noted al- 
ready in this connection. In the case of P", it should be noted that an offset of 2 codons in 
the position of a possible C2 rotation axis (from k = 34, between ACY and ACR to k = 32, 
after GCY) changes the axis from a pseudosymmetry axis (minima coincide with maxima after 
rotation) to a true symmetry axis (as the alignment of minima and maxima is shifted by four 
codons), necessitating fitting by a period-eight component which is even about k = 32. At the 
same time the large amplitude changes in the C ring appear to require an odd function, and are 
insensitive to whether the C2 axis is chosen at k = 32 or k = 34. The terms in Pq above have 
just these properties. 



4 Other amino acid properties 

In this section we move from the biologically-measured conformational parameters to biochemical 
indices of amino acid properties. Two of the most significant of these are the Grantham polarity 
(Grantham, 1974) and the relative hydrophilicity as obtained by Weber and Lacey (1978). 
Variations in chemical reactivity have been considered (Siemion and Stefanowicz, 1992), but are 
not modelled here. 

The composite Grantham index incorporates weightings for molecular volume and molecular 



weight, amongst other ingredients (Grantham, 1974). From figure IC it is evident that a major 
pattern is a broad 16-codon periodicity (indicative of a term linear in ^2)- Additional smaller 
fiuctuations coincide approximately with the 8-codon periodicity of the Y/ R nature of the third 
base (^3 — 7713 dependence). Although there is much complex variation due to the first base, in 
the interests of simplicity, the following fitted function ignores this latter structure, and provides 



an approximate (2-parameter) model (see figure 10): 



G6{di,mi,d2,m2,d3,m3) = 8.298 - 2.716^2 - 0.14(^3 - m3). 



(12) 



The pattern of amino acid hydrophilicity is also seen to possess an 8-codon periodicity. The 
4-parameter fitted function considered, which is plotted in fig. is: 

RfQ{di,mi,d2,m2,d3,m3) = 

0.816 - 0.038^2 - 0.043m2 + 0.022(^3 - ms) + 0.034(1 - d2)d2{dz - ms) (13) 

As with the case of Grantham polarity, the 8-period extrema might be more 'in phase' with 
the data if codons were weighted according to usage, after the approach of Siemion (Siemion et 
al.,1995). 

5 Conclusions and outlook 

In this paper we have studied codon and amino acid correlations across the genetic code starting 
from the simplest algebraic labelling scheme for nucleic acid bases (and hence RNA or DNA 
strands more generally). The relationship between the rank ordering of amino acids according 
to Siemion number k, k = 1,...,64, and a description of codons based on 3 dichotomic labels, 
had been established (figures |^,^). In §|2|several dinucleoside properties have been fitted as 
quadratic polynomials of the labels, and §|3| and have considered amino acid parameters as 
correlated to codons (trinucleotides), namely conformational parameters, Grantham polarity and 
hydrophilicity. The types of data considered for fitting in our approach include strictly physical 
information (amino acid molecular weight and volume), physico-chemical indices (for example, 
the semi-empirical indicators of dinucleoside free energy of formation, and the composite amino 
acid Grantham polarity), as well as biological measures (such as the conformational parameters, 
which are logarithmic measures of amino acid usage in structural protein elements). As pointed 
out in the introductory discussion, all of these measures should be considered as important 
aspects in the 'optimisation' of the genetic code (see also the remarks in the appendix, ^A| ) In 
all cases acceptable algebraic fitting is possible, and various patterns and periodicities in the 
data are readily traced to the contribution of specific monomials in the least-squares fit. 

As pointed out in the appendix, our algebraic approach is a special case of more general 
dynamical symmetry schemes in which measurable attributes H are given as combinations of 
Casimir invariants of certain chains of embedded Lie algebras and superalgebras (Bashford et 
al. 1997, 1998; Hornos and Hornos, 1993; Schlesinger et. al 1998; Schlesinger and Kent, 1999; 
Forger et al., 1997). The identification by Jungck (1978) of two or three major characters, to 
which all other properties are strongly correlated, would similarly in the algebraic description 
mean the existence of two or three distinct, 'master' Hamiltonians Hi, H2, H-^, . . . (possibly 
with differing branching chains). In themselves these could be abstract and need not have a 
physical interpretation, but all other properties should be highly correlated to them, 

K = aiHi + a2H2 + asH^. (14) 

Much has been made of the famous redundancy of the code in providing a key to a group- 
theoretical description (Hornos and Hornos, 1993; Forger et al. 1997). In the present framework 
(see also Bashford et al., 1997; 1998), codon degeneracies take second place to major features 
such as periodicity and other systematic trends across the genetic code. Thus for example the 
noted 8-codon periodicity of the conformational parameter allows the Y codons for k = 25, 
UCY, and k = 63, AGY both to be consistent with ser (as the property attains any given 
value twice per 8-codon period, at Y/R box k = 2A + 1 = 25, and again 4 periods later at the 
alternative phase /c = 56 -|- 7 = 63). 



A related theme is the reconstruction of plausible ancestral codes based on biochemical 
and genetic indications of the evolutionary youth of certain parts of the existing code. For 
example the anomalous features of arginine, arg which suggests that it is an 'intruder' has led 
(Jukes, 1973) to the proposal of a more ancient code using ornithine orn instead. This has 
been supported by the trigonometric fit to (Siemion et al., 1995; Siemion and Stefanowicz, 
1996), as the inferred parameters for orn actually match the fitted function better than arg at 
the k = 43, k = 61 CGR, AGR codons. Such variations could obviously have some influence on 
the polynomial fitting, but at the present stage have not been implemented|^ 

To the extent that the present analysis has been successful in suggesting the viability of an 
algebraic approach, further work with the intention of establishing ( p!4[ ) in a statistically reliable 
fashion may be warranted. What is certainly lacking to date is any microscopic justification 
for the application of the techniques of dynamical symmetry algebras (but see Bashford et al., 
1997,1998). However, it can be considered that in the path to the genetic code, the primitive 
evolving and self-organising system of information storage and directed molecular synthesis 
has been subjected to 'optimisation' (whether through error minimisation, energy expenditure, 
parsimony with raw materials, or several such factors). If furthermore the 'space' of possible 
codes has the correct topology (compact and convex in some appropriate sense), then it is not 
implausible that extremal solutions, and possibly the present code, are associated with special 
symmetries. It is to support the identification of such algebraic structures that the present 
analysis is directed. 

After this work was completed, we received a paper (Frappat et al.,2000) which gives a 
similar analysis of dinucleotide properties and correlations between physical-chemical properties 
of amino acids and codons based on a particular algebraic scheme (see also Frappat et al., 1998). 
It should be emphasized that comparisons of such analyses based merely on the number of fitted 
parameters is not particularly illustrative at this preliminary stage. One could modify, for 
example, Eq.(|2|) by cubic or quadratic transformations with the intent to minimise the number 
of parameters, but our motivation is to employ physical symmetry properties in an intuitive 
way. The number of parameters reflects the fact that our analysis has no prior commitment to 
any given abstract algebraic scheme. Indeed reproducing the fitted G'^j of (Frappat et al., 2000) 
requires a judicous choice of cubic monomial terms. 
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A Appendix: Dynamical symmetry algebras and genetic code 
structure 



The radical proposal of Hornos and Hornos (1993) to elucidate the genetic code structure us- 
ing the methods of dynamical symmetry algebras drew attention to the relationship of certain 
symmetry-breaking chains in the Lie algebra C3 ~ Sp{6) to the fundamental degeneracy pat- 
terns of the 64 codons. This theme has been taken up subsequently using various different Lie 
algebras (Schlesinger et al. 1998; Schlesinger and Kent, 1999; Forger et al.,1997) and also Lie su- 
peralgebras (Bashford et al., 1997,1998, Forger and Sachse,1998; Sorba and Sciarrino, 1998). In 
addition to possible insights into the code redundancy, a representation-theoretical description 
also leads to a code elaboration picture whereby evolutionary primitive, degenerate assignments 
of many codons to a few amino acids and larger symmetry algebras gave place, after symmetry 
breaking to subalgebras, to the incorporation of more amino acids, each with fewer redundant 
codons. 

In (Bashford et al., 1997,1998, Jarvis and Bashford, 1998) emphasis was given not to the 
patterns of codon redundancy as such, but rather to biochemical factors which have been recog- 
nised as fundamental keys to be incorporated in any account of evolution from a primitive coding 
system to the present universal one. Among these factors is the primacy of the second base letter 
over the first and third in correlating with such basic amino acid properties as hydrophilicity 
(Woese et al., 1966 ;Volkenstein, 1966). Also, the partial purine/pyrimidine dependence of 
the amino acid assignments within a family box further underlines the informational content 
of the third codon base (Siemion and Siemion, 1994) and necessitates a symmetry description 
which distinguishes the third base letter. In (Bashford et al., 1998) the amino acid degeneracy 
was replaced by the weaker condition of anticodon degeneracy, leading to a Lie superalgebra 
classification scheme using chains of subalgebras of A^^ ^ si (6/1) (see below for details). 

A concomitant of any representation-theoretical description of the genetic code is the 'weight 
diagram' mapping the 64 codons to points of the weight lattice (whose dimension is the rank 
of the algebra chosen). Reciprocally, the line of reasoning advocated above and applied in 
(Bashford et al., 1998) to the case of Lie superalgebras suggests that any description using 
dynamical symmetry algebras must be compatible with the combinatorial fact of the four-letter 
alphabet, three-letter word structure of the code. The viewpoint adopted in the present paper 
is to explore the implications of generic labelling schemes of this type, independently of the 
particular choice of algebra or superalgebra. In particular, as pointed out in ^ above, the weight 
diagram is supposed to arise from labelling each of the three base letters of the codon alphabet 
with a pair of dichotomic variables. Thus the only technical structural requirement for Lie 
algebras and superalgebras compatible with the present work is the existence of a 6-dimensional 
maximal abelian (Cartan) subalgebra, and of 64-dimensional irreducible representations whose 
weight diagram has the geometry of a six-dimensional hypercube in the weight lattice. (The 
relationship between the base alphabet and the Z2 x Z2 Klein four-group has been discussed 
by Bertman and Jungck (1979). As examples of a Lie algebra and a Lie superalgebra with this 
structure, we here take the case of ~ 5*0(13) and A^^ ~ sZ(6/l) respectively (other examples 
would be 50(4)3,5/(2/1)3). 

The orthogonal algebra 50(14) has been suggested (Schlesinger et al., 1998) as a unifying 
scheme for variants of the 5p(6) models (Hornos and Hornos, 1993; Forger et al.,1997). However, 
from the present perspective, it is sufficient to take spinor representations of the rank-6 odd 
orthogonal algebra 50(13) which have dimension 64. Consider the subalgebra chain 

50i3 D SOf^ X 5O9 



SOf^ D S0f^x50^^\ or 

where superscripts indicate base letter. The 64-dimensional representation sphts into 4 16-plets 
at the first breaking stage (the four famihes labelled by second codon base letter, the latter 
being distinguished as a spinor (|,0) + (0, ^) of SOf^). The same pattern repeats for the first 

codon base SO^^ providing a complete labelling of the 16 family boxes (fixed first and second 
base letter) . The last stage gives two possible alternatives for the third base symmetry breaking: 
in the first, each family box would split into two doublets (^,f) + of SO^^ x 5*02^^ 

corresponding to a perfect 32-amino-acid-codc 4^2 + 2, or to Y/R degeneracy in anticodon 
usage; in the second case, breaking of 5^2^'' to U^^'' yields a family box assignment 2 x (i,0) + 
(0, + ^) + (0, — ^) coinciding with a 48-amino-acid-code, 4— >2+l + l, orto perfect F-degeneracy 
and i?-splitting in amino acid usage. In the eukaryotic code, the 4 —> 2 + 1 + 1 family-box-pattern 
of anticodon usage is seen, whereas in the vertebrate mitochondrial code, only partial 4^2 + 2 
family box splitting of anticodon usage is found (see below). Finally, the above labels are all 
(up to normalisation) of the form (0, ±1) or (±1,0) for each base letter (or (±1,±1) for the 
third base for one branching) showing that this group-theoretical scheme does indeed give a 
hypercubic geometry for the codon weight diagram. 

The s/(6/l) superalgebra was advocated in a survey of possible Lie superalgebras relevant to 
the genetic code (Bashford et al. 1997,1998), and possesses irreducible, typical representations 
of dimension 64 which share many of the properties of spinor representations of orthogonal Lie 
algebras (in the family s/„/i of Lie superalgebras this class of representations has dimension 2") 
and so can be compared with spinors of the even- and odd-dimensional Lie algebras of rank 
n, namely S02n and S02n+i respectively). The superalgebra branching chain related to the 
50(13) chain described above is 

D sl^^ x sl^^/i 

Z!) ^ ^^2 ^ ^^2/1' 

D sli/i or 
D sl2 X Ui 

where the last two steps correspond as above either to family box breaking to Y/R doublets 
(as in many of the anticodon assignments of the vertebrate mitochondrial code) or to a 4 ^ 
2 + 1 + 1 pattern (as in the anticodons of the eukaryotic code). The nature of the weight 
diagram follows from knowledge of the branching in each of the above embcddings. In fact both 
in the decomposition of the irreducible 64 to families of 16, and in that of the 16 to family 
boxes of 4, there are a doublet and two singlets of the accompanying SI2 and slf algebras, 
so that the diagonal Cartan element (magnetic quantum number) has the spectrum 0,+^. A 
second diagonal label arises because there is also an additional commuting Ui generator at each 
stage with value ±1 on the two singlets and on the doublet. Alternatively, the additional 
label may be taken as the ±1 or shift in the noninteger Dynkin label of the commuting sl^/i 
algebra (n = 4 and n = 2 respectively). Similar considerations apply to the last branching stage 
(Bashford et al., 1998), so that again the weight diagram has the hypercubic geometry assumed 
in the text of the paper. 
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si 
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In the dynamical symmetry algebra approach to problems of complex spectra, important 
physical quantities such as the energy levels of the system, and the transition probabilities for 
decays, are modelled as matrix elements of certain operators belonging to the Lie algebra or 
superalgebra. In particular, the Hamiltonian operator which determines the energy is assumed 
to be a linear combination of a set of invariants of a chain of subalgebras G D Gi D G2 D ■ ■ ■ T: 

H = ciFi + C2T2 H h ctTt 

for coefficients q to be determined. For states in a certain representation of the algebra G, the 
energy can often be evaluated once the hierarchy of representations oi D Gi D G2 D ■ ■ - T to 
which they belong is identified, as the invariants are functions of the corresponding representation 
labels. 

As has been emphasised above, the discussion of fitting of codon and amino acid properties in 
the main body of the paper is independent of specific choices of Lie algebras or superalgebras. In 
fact, the polynomial functions of the 6 codon coordinates may simply be regarded as generalised 
invariants of the smallest subalgebra common to all cases, namely the 6-dimensional Cartan 
(maximal abelian) subalgebra T (so that there are several nonzero coefficients ct, with all other 
Cj zero). This approach is thus complementary to detailed applications of a chosen symmetry 
algebra, where the coefficients Cj (including ct) might accompany a specific set of Fj (functions 
of the whole hierarchy of labels, whose form is fixed, depending on the subalgebra). However, 
because the weight labels used in the present work already provide an unambiguous identification 
of the 64 states, such functions of any possible additional labels are in principle determined as 
cases of the general expansions we have been studying. For this reason the present work, although 
deliberately of a generic nature, does indeed confirm the viability of the dynamical symmetry 
approach. 



Captions of Tables and figures 

Table ^ Table of dinucleoside properties and predicted values from fits. G37: Turner free- 
energy parameters at 37° in kcal mol~^ (Xia et al., 1998) ; Rf. dinucleoside monophosphate rel- 
ative hydrophilicity (Weber and Lacey, 1978); Rx- dinucleotide relative hydrophobicity (Jungck, 
1978). 

Table |2|: Table of amino acid properties {Pa,/3- conformational parameters (Fasman, 1989); Pcr- 
Grantham polarity (Grantham, 1974); Rj: Relative hydrophilicity (Weber and Lacey, 1978). 
Figure IB: Least-squares fit (curve) to the Turner free-energy parameters (points) at 37° given 
by Eq. (^). Units are in kcal mol~^. 

Figure |^: Least-squares fits for dinucleoside Rf (upper) and dinucleotide Rx (lower). Points are 
experimental values while the curves are least-squares fits given by Eqs.(^) and (P) respectively. 
Figure |3[ 'Weight diagram' for the genetic code, arising as the superposition of two projections 
of the 6-dimensional space of codon coordinates onto planes corresponding to coordinates for 
bases of the first and second codon letters, and an additional one-dimensional projection along 
a particular direction in the space of the third codon base. The orientations of the three projec- 
tions are chosen to correspond with the rank ordering of amino acids according to the one-step 
mutation rings. 

Figure ^: Siemion's interpretation of the weight diagram in terms of the rank ordering of 
'one-step mutation rings'. Reproduced from Siemion, 1994a. 

Figure |^: Least-squares trigonometric fit (dots) to the conformational parameter (small 
circles) as a function of mutation angle k. Crosses ("pref") denote the fitted function, Eq.(|^), 
evaluated at preferred codon positions. 

Figure |6|: Estimated trigonometric fit (dots) to (small circles) as a function of k. Crosses 

("pref") denote the fitted function, Eq.(P), evaluated at preferred codon positions. 

Figure 0: Polynomial fits (black circles) to the P° conformational parameter (white circles) as 

a function of the six codon coordinates. The fit is given by Eq.(|l^). Crosses ("modif"): same 

function, with the di^id-i — m^) coefficient modified from —0.075 to —0.2 to enhance eight-codon 

periodicity. 

Figure |8[ Least-squares fit (5 parameters) to the P^^ conformational parameter as a function 
of the six codon coordinates. Small circles: data; dots: least-squares fit given by Eq.(pT|). 
Figure |^: Least-squares fit (4 parameters) to relative hydrophilicity (Weber and Lacey, 1978) 
as a function of the six codon coordinates. Small circles: data; dots: least-squares fit given by 
Eq.(ll). 

Figure |l^: Least-squares fit (2 parameters) to Grantham polarity (Grantham, 1974) as a 
function of the six codon coordinates. Small circles: data; dots: least-squares fit given by 
Eq.(ll). 
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